17.7 Phylogenies
265
Word Occurrences
Once the single-nucleotide frequencies are known, it is possible to calculate the
expectations of the frequencies ofnn-grams assembled by random juxtaposition. Con-
straints on the assembly are revealed by deviations of the actual frequencies from
the expected values. This is the principle of the determination of dinucleotide bias. It
is, however, limited with regard to the inferences that may be drawn. For one thing,
as nn increases, the statistics become very poor. The genome of E. coli, for example,
is barely large enough to contain a single example of every possible 11-gram even
if each one was deliberately included. Furthermore, the comparison of actual fre-
quencies with expected ones depends on the model used to calculate the expected
frequencies. All higher order correlations are subsumed into a single number, from
which little can be said about the relative importance of a particular sequence.
It is possible to approach this problem more objectively (according to a maximum
entropy principle 23) by asking what is the most probable continuation of a given
nn-gram (cf. Eq. 6.21). Frequency dictionaries may be reconstructed from thinner ones
according to this principle; for example, if one wishes to reconstruct the dictionary
upper W Subscript nWn from upper W Subscript n minus 1Wn−1, the reconstructed frequencies are 24
f overTilde Subscript i 1 comma ellipsis comma i Sub Subscript n Subscript Baseline equals StartFraction f Subscript i 1 comma ellipsis comma i Sub Subscript n minus 1 Subscript Baseline f Subscript i 2 comma ellipsis comma i Sub Subscript n Subscript Baseline Over f Subscript i 2 comma ellipsis comma i Sub Subscript n minus 1 Subscript Baseline EndFraction comma ˜fi1,...,in = fi1,...,in−1 fi2,...,in
fi2,...,in−1
,
(17.4)
wherei 1 comma ellipsisi1, . . . are the successive nucleotides in thenn-gram. The reconstructed dictio-
nary is denoted byupper W overTilde Subscript n Baseline left parenthesis n minus 1 right parenthesis ∼
Wn(n −1). The most unexpected, and hence informative,nn-grams
are then those with the biggest differences between the real and reconstructed fre-
quencies (i.e., with values of the ratio f divided by f overTilde f/ ˜f significantly different from unity).
17.7
Phylogenies
The notion that life-forms evolved from a single common ancestor (i.e., that the
history of life is a tree) is pervasive in biology. 25 Before gene and protein sequences
became available, trees were constructed from the externally observable character-
istics of organisms. Each organism is therefore represented by a point in phenotype
space. In the simplest (binary) realization, a characteristic is either absent (0) or
present (1) or is present in either a primitive (0) or an evolved (1) form. The distance
23 The entropy of a frequency dictionary is defined as
upper S Subscript n Baseline equals minus sigma summation Underscript j equals 1 Overscript Endscripts f Subscript j Baseline log f Subscript j Baseline periodSn = −
Σ
j=1
f j log f j .
(17.3)
24 Gorban et al. (2000).
25 The concept of phylogeny was introduced by E. Haeckel; see Sect. 14.9.